ReadItAndKeep: rapid decontamination of SARS-CoV-2 sequencing reads

Abstract Summary Viral sequence data from clinical samples frequently contain contaminating human reads, which must be removed prior to sharing for legal and ethical reasons. To enable host read removal for SARS-CoV-2 sequencing data on low-specification laptops, we developed ReadItAndKeep, a fast lightweight tool for Illumina and nanopore data that only keeps reads matching the SARS-CoV-2 genome. Peak RAM usage is typically below 10 MB, and runtime less than 1 min. We show that by excluding the polyA tail from the viral reference, ReadItAndKeep prevents bleed-through of human reads, whereas mapping to the human genome lets some reads escape. We believe our test approach (including all possible reads from the human genome, human samples from each of the 26 populations in the 1000 genomes data and a diverse set of SARS-CoV-2 genomes) will also be useful for others. Availability and implementation ReadItAndKeep is implemented in C++, released under the MIT license, and available from https://github.com/GenomePathogenAnalysisService/read-it-and-keep. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Since experimental isolation of viral DNA from the host is imperfect, viral sequence data is frequently contaminated with host DNA sequence data. Removal of host sequence is a first step for many analyses, and where the host is human this is essential to safeguard patient anonymity. Typical approaches (Bush et al., 2020) either map reads directly to the host genome [e.g. using BWA MEM (Li, 2013), Bowtie2 (Langmead and Salzberg, 2012)] or use a metagenomics classifier [e.g. Kraken2 (Wood et al., 2019)] to assign each read to a species. However in some circumstances (such as a global pandemic following a recent zoonosis) the viral genome is known and of limited diversity, opening up the possibility of positively identifying viral reads by mapping to a reference. In this article, we develop a simple tool that scans sequence data and retains only that which maps to the viral genome. By rigorously testing both theoretically and with human data from diverse global populations predating the pandemic, we are able to give convincing evidence that mapping to a modified SARS-CoV-2 reference is sufficient to guarantee removal of human data. The tool, named ReadItAndKeep, is extremely fast and requires very little RAM-typically a few MB as compared with around 10GB for methods based on mapping to the human genome. This allows read decontamination locally on a standard laptop before uploading to a shared or public server for analysis, or depositing in read archives.

Materials and methods
ReadItAndKeep is implemented in Cþþ, using the API of minimap2 to match reads to a target genome. Hits from minimap2 are used without performing full alignment (equivalent to minimap2 default command line options, reporting approximate mappings).
A read is retained if it has a match that is at least 50 bp or is at least 50% of the length of the read (these are default values of userspecifiable parameters). In the case of paired reads, a pair is kept if either of the reads have a suitable match. ReadItAndKeep uses the minimap2 presets 'short read' or 'ont' for Illumina and Oxford Nanopore Technology (ONT) reads respectively (same as command line options -x sr or -x map-ont). Retained reads are written to gzipped FASTQ file(s).
We compared ReadItAndKeep with a standard approach of removing reads matching the human genome. We benchmarked against the tool Dehumanizer (https://github.com/SamStudio8/dehu manizer), which wraps mappy/minimap2, with its recommended

3291
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Results
Complete benchmarking results are shown in Supplementary Table S1, summarized in Table 1 and described below.
Evaluation of human read removal: we first checked that ReadItAndKeep should in principle remove human reads, using all 75-mers and 150-mers of the human reference as 'reads', with the target genome SARS-CoV-2 MN908947.3. Only 90 469 (0.003%) 75-mers were retained, all of which matched the 33 bp poly-A tail of MN908947.3. Since this tail provides no useful information and is excluded by SARS-CoV-2 amplicon sequencing, we removed if from the viral genome for all further analysis. Using this trimmed sequence as the target, all tested k-mers were removed by ReadItAndKeep. Dehumanizer retained 0.76% of the 75 bp reads, and 0.03% of the 150 bp reads (Table 1). We then measured the success at human read removal on 27 Illumina runs from the expanded 1000 genomes project (Byrska-Bishop et al., 2021): the well-studied sample NA12878 plus one sample from each of the 26 populations, originating from Africa, Asia, Europe and the Americas (Supplementary Table S2). A high depth run of ONT reads from NA12878 was also tested (Jain et al., 2018). Note that all of these samples were sequenced years before the SARS-CoV-2 virus jumped into humans, and so we assume that all reads in these datasets should be excluded. Across all these samples, ReadItAndKeep retained zero reads, but Dehumanizer kept 1.8% Illumina and 10% ONT reads (Table 1). Further investigation of the 10% showed they were heavily enriched for very low quality and repetitive reads, with multi-kb softclipped regions.
Quantification of SARS-CoV-2 read retention: we confirmed that all 75-mers and 150-mers from the SARS-CoV-2 reference genome were retained by Dehumanizer and ReadItAndKeep. Next, a set of genetically diverse samples was collated, comprising 246 Illumina and 189 ONT sequencing runs, chosen (see Supplementary Text) to maximize unique protein mutations and ensure a range of lineages as assigned by Pangolin (O'Toole et al., 2021). Dehumanizer retained > 99.99% of reads, and ReadItAndKeep kept > 99.99% of ONT reads and 99.89% of Illumina reads (Table 1). For diagnostic purposes, those reads excluded by ReadItAndKeep were then mapped to the SARS-CoV-2 genome using Bowtie 2 (Langmead and Salzberg, 2012) with the -very-sensitive-local option. The excluded reads were highly enriched for low quality-with either a very short match or high error rate (see Supplementary Fig. S1, Supplementary Table S3). The greatest loss in mean per-base depth was 0.21% for ONT and 1.87% for Illumina (238/246 Illumina samples had mean loss <1%) (Supplementary Table S3). We conclude this loss of a tiny volume of low quality reads would not affect downstream analyses.

Discussion
There are broadly three options for decontaminating SARS-CoV-2 datasets: exclude reads mapping to human (as done by Dehumaniser), keep reads mapping to the virus (as done by ReadItAndKeep) or do both (first map to the virus, and then exclude any of that also map to human, as is done by the COG consortium). We have shown that, by trimming the poly-A tail from the SARS-CoV-2 genome used by ReadItAndKeep, we completely remove spurious matches of human reads. Thus ReadItAndKeep offers an approach that is more reliable than just mapping to the human genome, and lighter weight (low RAM, fast) than either of the other two approaches.
We also investigated using ReadItOnKeep for Influenza A and HIV-1 samples, which are known to be significantly more diverse than SARS-CoV-2. Although all human reads were removed, the method was not effective in retaining viral reads, in extreme cases rejecting more than half. Therefore we only recommend ReadItAndKeep for viruses with low levels of diversity-our focus was SARS-CoV-2.
Finally, one challenge for implementing pathogen sequencing in healthcare systems is justifying what proportion of human reads must be removed to guarantee non-identifiability. By explicitly testing with all possible 75 and 150 bp reads in the (extended) human reference genome, and 27 human genome samples from different global ancestries, we were able to show ReadItAndKeep excluded every single human read. We hope the benchmarking approach itself will be of use, and that the speed and low resource requirements will make ReadItAndKeep of wide utility.

Data availability
The data underlying this article are available in the article and in its online supplementary material.  Note: Percent reads retained is calculated from summing across reads from all samples in the dataset. Mean run time is the mean wall clock time used across all samples in the dataset.